library(tidyverse)

Interactive plot

interactive plot using fifa18 dataset : here we are trying to see if age group has any effect on the potential, shot and passing capabilities of the player.

fifa18 dataset contains 17076 rows and 40 columns depicting different attributes of the player like stamina,potential, ability to kick, pass and display aggressiona nd balance.

fifa <- read_csv("C:/Susmitha Chereddy/Data_visualization/Mini_project_2_chereddy/data/fifa18.csv")
## Rows: 17076 Columns: 40
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): name, nationality, club
## dbl (37): age, overall, potential, acceleration, aggression, agility, balanc...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fifa
## # A tibble: 17,076 × 40
##    name        nationality club    age overall potential acceleration aggression
##    <chr>       <chr>       <chr> <dbl>   <dbl>     <dbl>        <dbl>      <dbl>
##  1 Cristiano … Portugal    Real…    32      94        94           89         63
##  2 L. Messi    Argentina   FC B…    30      93        93           92         48
##  3 Neymar      Brazil      Pari…    25      92        94           94         56
##  4 L. Suárez   Uruguay     FC B…    30      92        92           88         78
##  5 M. Neuer    Germany     FC B…    31      92        92           58         29
##  6 R. Lewando… Poland      FC B…    28      91        91           79         80
##  7 De Gea      Spain       Manc…    26      90        92           57         38
##  8 E. Hazard   Belgium     Chel…    26      90        91           93         54
##  9 T. Kroos    Germany     Real…    27      90        90           60         60
## 10 G. Higuaín  Argentina   Juve…    29      90        90           78         50
## # … with 17,066 more rows, and 32 more variables: agility <dbl>, balance <dbl>,
## #   ball_control <dbl>, composure <dbl>, crossing <dbl>, curve <dbl>,
## #   dribbling <dbl>, finishing <dbl>, free_kick_accuracy <dbl>,
## #   gk_diving <dbl>, gk_handling <dbl>, gk_kicking <dbl>, gk_positioning <dbl>,
## #   gk_reflexes <dbl>, heading_accuracy <dbl>, interceptions <dbl>,
## #   jumping <dbl>, long_passing <dbl>, long_shots <dbl>, marking <dbl>,
## #   penalties <dbl>, positioning <dbl>, reactions <dbl>, short_passing <dbl>, …
summary(fifa$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   21.00   25.00   25.11   28.00   47.00
fifa_group <-fifa %>% mutate(
                              # Create categories
                              age_group = dplyr::case_when(
                              age <= 21            ~ "very_young",
                              age > 21 & age <= 25 ~ "young",
                              age > 25 & age <= 28 ~ "prime",
                              age > 28             ~ "experienced"
                            ),
                            # Convert to factor
                            age_group = factor(age_group,level = c("very_young", "young","prime", "experienced"))
                            )

fifa_group
## # A tibble: 17,076 × 41
##    name        nationality club    age overall potential acceleration aggression
##    <chr>       <chr>       <chr> <dbl>   <dbl>     <dbl>        <dbl>      <dbl>
##  1 Cristiano … Portugal    Real…    32      94        94           89         63
##  2 L. Messi    Argentina   FC B…    30      93        93           92         48
##  3 Neymar      Brazil      Pari…    25      92        94           94         56
##  4 L. Suárez   Uruguay     FC B…    30      92        92           88         78
##  5 M. Neuer    Germany     FC B…    31      92        92           58         29
##  6 R. Lewando… Poland      FC B…    28      91        91           79         80
##  7 De Gea      Spain       Manc…    26      90        92           57         38
##  8 E. Hazard   Belgium     Chel…    26      90        91           93         54
##  9 T. Kroos    Germany     Real…    27      90        90           60         60
## 10 G. Higuaín  Argentina   Juve…    29      90        90           78         50
## # … with 17,066 more rows, and 33 more variables: agility <dbl>, balance <dbl>,
## #   ball_control <dbl>, composure <dbl>, crossing <dbl>, curve <dbl>,
## #   dribbling <dbl>, finishing <dbl>, free_kick_accuracy <dbl>,
## #   gk_diving <dbl>, gk_handling <dbl>, gk_kicking <dbl>, gk_positioning <dbl>,
## #   gk_reflexes <dbl>, heading_accuracy <dbl>, interceptions <dbl>,
## #   jumping <dbl>, long_passing <dbl>, long_shots <dbl>, marking <dbl>,
## #   penalties <dbl>, positioning <dbl>, reactions <dbl>, short_passing <dbl>, …

Visualizing relationship between shot_power and long_passing

library(viridis)
## Loading required package: viridisLite
my_fifa_plot_1 <- ggplot(data = fifa_group) +
 geom_point(aes(x = shot_power, y = long_passing,
                color=age_group), alpha = 0.5)+
 scale_color_viridis(discrete = TRUE)+
 scale_x_log10() +
 labs(title = "Relationship between shot_power and long_passing",
        subtitle = "fifa datset",
        caption = "Data source: reisanar/datasets")+
  theme(plot.title = element_text(hjust = 0.5, size = 14),
        plot.subtitle = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 1, face = "italic"))

fifa_plot<- my_fifa_plot_1 +annotate("text", x = c(50,50), y = c(50,50), 
           label = c("Long-pass ~ shot power") , color="white", 
           size=4 , angle=45, fontface="bold") 
fifa_plot

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
ggplotly(fifa_plot)

Visualizing relation ship between potential and standing_tackle

library(viridis)
my_fifa_plot_2 <- ggplot(data = fifa_group) +
 geom_point(aes(x = potential, y = standing_tackle,
                color=age_group), alpha = 0.5)+
 scale_color_viridis(discrete = TRUE)+
 scale_x_log10() +
  labs(title = "Relationship between potential and standing_tackle",
        subtitle = "fifa datset",
        caption = "Data source: reisanar/datasets")+
  theme(plot.title = element_text(hjust = 0.5, size = 14),
        plot.subtitle = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 1, face = "italic"))

my_fifa_plot_2

library(plotly)
ggplotly(my_fifa_plot_2)

spatial visualization

spatial visualization using florida lakes data set: plotting to visualize the number of lakes in state of florida and especially the polk country if the name lakleand has any relation to the number of lakes here Florida Lakes Dataset contains: 4234 rows and 7 columns. 1.PERIMETER 2.NAME 3.COUNTY 4.OBJECTID 5.SHAPEAREA 6.SHAPELEN 7.geometry

library(sf)
## Linking to GEOS 3.9.1, GDAL 3.3.2, PROJ 7.2.1; sf_use_s2() is TRUE
florida_shapes <- read_sf("C:/Susmitha Chereddy/Data_visualization/Mini_project_2_chereddy/data/Florida_Lakes/Florida_Lakes/Florida_Lakes.shp") 
florida_shapes
## Simple feature collection with 4243 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.42774 ymin: 25.02625 xmax: -80.03097 ymax: 31.00254
## Geodetic CRS:  WGS 84
## # A tibble: 4,243 × 7
##    PERIMETER NAME   COUNTY OBJECTID SHAPEAREA SHAPELEN                  geometry
##        <dbl> <chr>  <chr>     <int>     <dbl>    <dbl>        <MULTIPOLYGON [°]>
##  1    11082. Lake … ORANGE        1  1818000.   11082. (((-81.34813 28.62354, -…
##  2     2834. Black… ESCAM…        2    31380.    2834. (((-87.42029 30.49087, -…
##  3    18768. Lake … HIGHL…        3 13601177.   18768. (((-81.4614 27.46472, -8…
##  4      493. Halfm… ESCAM…        4     6337.     493. (((-87.3131 30.74034, -8…
##  5     5663. Cresc… ESCAM…        5   338242.    5663. (((-87.27591 30.4692, -8…
##  6      317. Black… SANTA…        6     2380.     317. (((-87.26869 30.69546, -…
##  7      181. Beave… ESCAM…        7     1381.     181. (((-87.27064 30.70558, -…
##  8     1376. Salte… ESCAM…        8    24421.    1376. (((-87.26273 30.94937, -…
##  9     1914. Forty… SANTA…        9   178663.    1914. (((-87.18693 30.81357, -…
## 10      328. Hutso… SANTA…       10     7838.     328. (((-87.14079 30.96851, -…
## # … with 4,233 more rows

Visualizing all lakes in Florida

ggplot()+
 geom_sf(data = florida_shapes, aes(fill = SHAPEAREA),
 color = "black", size = 0.15) +
 scale_fill_gradient(labels = scales::comma)+
 theme(legend.position = "right")+
 scale_fill_continuous(low="Darkblue", high="blue", 
                       guide="colorbar",na.value="DarkGrey")+
  labs(title = "Map of All Lakes in state of Florida",
        subtitle = "Lakes shown in blue",
        caption = "Data source: reisanar/datasets")+
  theme(plot.title = element_text(hjust = 0.5, size = 14),
        plot.subtitle = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 1, face = "italic"))
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.

Visualizing lakes in Orange county

florida_shapes %>%
 filter(COUNTY == "ORANGE") %>%
 ggplot() +
 geom_sf(aes(fill = SHAPEAREA),
 color = "black", size = 0.15) +
 scale_fill_gradient(labels = scales::comma)+
 theme(legend.position = "right")+
 scale_fill_continuous(low="navyblue", high="blue", 
                       guide="colorbar",na.value="DarkGrey")+
  labs(title = "Map of All Lakes in Orange County",
        subtitle = "Orange County: State of Florida",
        caption = "Data source: reisanar/datasets")+
  theme(plot.title = element_text(hjust = 0.5, size = 14),
        plot.subtitle = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 1, face = "italic"))
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.

Visualizing lakes in polk county

florida_shapes %>%
 filter(COUNTY == "POLK") %>%
 ggplot() +
 geom_sf(aes(fill = SHAPEAREA),
 color = "black", size = 0.15) +
 scale_fill_gradient(labels = scales::comma)+
 theme(legend.position = "right")+
 scale_fill_continuous(low="navyblue", high="blue", 
                       guide="colorbar",na.value="DarkGrey")+
  labs(title = "Map of All Lakes in PASCO County",
        subtitle = "Orange County: State of Florida",
        caption = "Data source: reisanar/datasets")+
  theme(plot.title = element_text(hjust = 0.5, size = 14),
        plot.subtitle = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 1, face = "italic"))
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.

Visualizing lakes of florida with Perimeter as fill

ggplot()+
 geom_sf(data = florida_shapes, aes(fill = PERIMETER),
 color = "black", size = 0.15) +
 scale_fill_gradient(labels = scales::comma)+
 theme(legend.position = "right")+
 scale_fill_continuous(low="Darkblue", high="blue", 
                       guide="colorbar",na.value="DarkGrey")+
  labs(title = " Perimeter Map of All Lakes in state of Florida",
        subtitle = "Lakes shown in blue",
        caption = "Data source: reisanar/datasets")+
  theme(plot.title = element_text(hjust = 0.5, size = 14),
        plot.subtitle = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 1, face = "italic"))
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.

visualization of a model

Visualization of the model on Housing data set: here we are visualizing the prediction of House prices using Housing dataset and see how different variables affect the resultant prices WestRoxbury dataset contains 5802 rows with 14 columns. 1.Total_Value 2.TAX 3.LOT_SQFT 4.YR BUILT 5.GROSS AREA 6.LIVING AREA 7.FLOORS 8.ROOMS 9.BEDROOMS 10.FULL_BATH 11.HALF_BATH 12.KITCHEN 13.FIREPLACE 14.REMODEL

WestRoxbury <- read_csv("C:/Susmitha Chereddy/Data_visualization/Mini_project_2_chereddy/data/WestRoxbury.csv") %>% rename (Total_Value = `TOTAL VALUE`,LOT_SQFT=`LOT SQFT`,GROSS_AREA=`GROSS AREA`,FULL_BATH=`FULL BATH`,LIVING_AREA=`LIVING AREA`,HALF_BATH=`HALF BATH`)
## Rows: 5802 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): REMODEL
## dbl (13): TOTAL VALUE, TAX, LOT SQFT, YR BUILT, GROSS AREA, LIVING AREA, FLO...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
WestRoxbury
## # A tibble: 5,802 × 14
##    Total_Value   TAX LOT_SQFT `YR BUILT` GROSS_AREA LIVING_AREA FLOORS ROOMS
##          <dbl> <dbl>    <dbl>      <dbl>      <dbl>       <dbl>  <dbl> <dbl>
##  1        344.  4330     9965       1880       2436        1352      2     6
##  2        413.  5190     6590       1945       3108        1976      2    10
##  3        330.  4152     7500       1890       2294        1371      2     8
##  4        499.  6272    13773       1957       5032        2608      1     9
##  5        332.  4170     5000       1910       2370        1438      2     7
##  6        337.  4244     5142       1950       2124        1060      1     6
##  7        359.  4521     5000       1954       3220        1916      2     7
##  8        320.  4030    10000       1950       2208        1200      1     6
##  9        334.  4195     6835       1958       2582        1092      1     5
## 10        409.  5150     5093       1900       4818        2992      2     8
## # … with 5,792 more rows, and 6 more variables: BEDROOMS <dbl>,
## #   FULL_BATH <dbl>, HALF_BATH <dbl>, KITCHEN <dbl>, FIREPLACE <dbl>,
## #   REMODEL <chr>
summary(WestRoxbury$FLOORS)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.684   2.000   3.000

Linear model Total_Value~LOT_SQFT

ggplot(WestRoxbury, aes(x = LOT_SQFT, y = Total_Value)) +
 geom_point() +
 geom_smooth(method = "lm",formula = "y ~ x") +
 theme_minimal()+
  labs(title = "Interaction between Total House Value and LOT AREA",
        subtitle = "Total value in $$ and LOT Area in SQFT",
        caption = "Data source: reisanar/datasets")+
  theme(plot.title = element_text(hjust = 0.5, size = 14),
        plot.subtitle = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 1, face = "italic"))

Linear model Total_Value~GROSS_AREA

ggplot(WestRoxbury, aes(x = GROSS_AREA, y = Total_Value)) +
 geom_point() +
 geom_smooth(method = "lm",formula = "y ~ x") +
 theme_minimal()+
  labs(title = "Interaction between Total House Value and GROSS AREA",
        subtitle = "Total value in $$ and GROSS Area in SQFT",
        caption = "Data source: reisanar/datasets")+
  theme(plot.title = element_text(hjust = 0.5, size = 14),
        plot.subtitle = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 1, face = "italic"))

Linear model Total_Value~LIVING_AREA

ggplot(WestRoxbury, aes(x = LIVING_AREA, y = Total_Value)) +
 geom_point() +
 geom_smooth(method = "lm",formula = "y ~ x") +
 theme_minimal()+
  labs(title = "Interaction between Total House Value and LIVING AREA",
        subtitle = "Total value in $$ and LIVING Area in SQFT",
        caption = "Data source: reisanar/datasets")+
  theme(plot.title = element_text(hjust = 0.5, size = 14),
        plot.subtitle = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 1, face = "italic"))

Linear Model Total_Value ~ LOT_SQFT + HALF_BATH + FLOORS

library(broom)
house_model <- lm(Total_Value ~ LOT_SQFT + HALF_BATH + FLOORS, data= WestRoxbury)
house_coefs <- tidy(house_model, conf.int = TRUE) %>%
 filter(term != "(Intercept)") # We can typically skip plotting the intercept, so remove it
house_coefs
## # A tibble: 3 × 7
##   term      estimate std.error statistic   p.value conf.low conf.high
##   <chr>        <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 LOT_SQFT    0.0184  0.000343      53.7 0           0.0178    0.0191
## 2 HALF_BATH  29.0     1.80          16.1 4.67e- 57  25.5      32.5   
## 3 FLOORS     88.2     2.15          41.0 2   e-323  84.0      92.4

plotting graph of estimates for the house Model (Total_Value ~ LOT_SQFT + HALF_BATH + FLOORS)

ggplot(house_coefs,
 aes(x = estimate,
 y = fct_rev(term))) +
 geom_pointrange(aes(xmin = conf.low,
 xmax = conf.high)) +
 geom_vline(xintercept = 0,
 color = "purple") +
 theme_minimal()+
  labs(title = " Graph of Estimates used in the model (LOT_SQFT,HALF_BATH,FLOORS)",
        subtitle = "LOT Area in SQFT Floor and half bath in levels 1,2,3",
        caption = "Data source: reisanar/datasets")+
  theme(plot.title = element_text(hjust = 0.5, size = 14),
        plot.subtitle = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 1, face = "italic"))

house_new_data <- expand_grid(
 LOT_SQFT = mean(WestRoxbury$LOT_SQFT),
 FLOORS = c(1,2,3),
 HALF_BATH = c(1,2,3))
head(house_new_data)
## # A tibble: 6 × 3
##   LOT_SQFT FLOORS HALF_BATH
##      <dbl>  <dbl>     <dbl>
## 1    6278.      1         1
## 2    6278.      1         2
## 3    6278.      1         3
## 4    6278.      2         1
## 5    6278.      2         2
## 6    6278.      2         3
predicted_house <- augment(
 house_model,
 newdata = house_new_data,
 se_fit = TRUE
 )
head(predicted_house)
## # A tibble: 6 × 5
##   LOT_SQFT FLOORS HALF_BATH .fitted .se.fit
##      <dbl>  <dbl>     <dbl>   <dbl>   <dbl>
## 1    6278.      1         1    344.    2.02
## 2    6278.      1         2    373.    3.39
## 3    6278.      1         3    402.    5.04
## 4    6278.      2         1    432.    1.21
## 5    6278.      2         2    461.    2.54
## 6    6278.      2         3    490.    4.24

plot for fitted values for each row

ggplot(predicted_house,
 aes(x = FLOORS, y = .fitted)) +
 geom_ribbon(aes(ymin = .fitted +
 (-1.96 * .se.fit),
 ymax = .fitted +
 (1.96 * .se.fit),
 fill = HALF_BATH),
 alpha = 0.5) +
 geom_line(aes(color = HALF_BATH), size = 1) +
 guides(fill = FALSE, color = FALSE) +
 facet_wrap(vars(HALF_BATH)) +
 theme_minimal()+
 labs(title = "plot for fitted values for each row",
        subtitle = "Total value in $$ and LOT Area in SQFT",
        caption = "Data source: reisanar/datasets")+
  theme(plot.title = element_text(hjust = 0.5, size = 14),
        plot.subtitle = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 1, face = "italic"))
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

REPORT

What were the original charts you planned to create for this assignments? What steps were necessary for cleaning and preparing the data?

I have used 3 datasets for this mini project. For the first Interactive visualization plot, I wanted to identify the differences between power, passing, Tackle and potential of different players. Since there is no age group I used Age to group the data into four groups and then used to perform my analysis.

For the second dataset, since it was spatial visualization, i used the data as it is available. For the Third dataset, I change the variable names to avoid the extra spaces between the variables.

What story could you tell with your plots? What difficulties did you encounter while creating the visualizations? What additional approaches do you think can be use to explore the data you selected?

From the First interaction plot,

  1. I could say the shot_power and long passing are highly correlated and it makes sense since you need shot_power for passing the ball long.
  2. I could also say that: Although there is enough correlation between standing tackle and potential, we see enough enough data points to say that this correlation is not strong enough.
  3. Also, another finding is that Age has very effect on power, passing, Tackle and potential of players.

From the Spatial Visualization plot:

  1. we could say that although the data set does not give the shape of Florida, visualizing all the lakes gives a rough shape of Florida which says that coast of florida have a lot of lakes. So many that it gives the shape of the cost.
  2. Also, after visualuzing the the lakes in Polk county, we could see how lakeland got it’s name.

From the Visualization of Model, we could see that

  1. Number of FLOORS and HALF_BATH in a Home are highly significant in predictng the price of the home in the respective order.
  2. Co-efficient of LOT_SQFT is almost close to zero, which means it is very less significant compared to other variables.

How did you apply the principles of data visualizations and design for this assignment?

  1. I made sure that proper Color palette are used for accurate representation of graphs.
  2. All graphs are provided with Titles and captions to explain about the information regarding the Grpah/Chart.
  3. Size of the Chart have been adjusted accordingly to include all the elements in the chart
  4. Legend positions have been adjusted to make sure that scales are properly visible.